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Abstract — The problem of joint universal source coding and 
modeling, treated in the context of lossless codes by Rissanen, 
was recently generalized to fixed-rate lossy coding of finitely 
parametrized continuous-alphabet i.i.d. sources. We extend these 
results to variable-rate lossy block coding of stationary ergodic 
sources and show that, for bounded metric distortion measures, 
any finitely parametrized family of stationary sources satisfying 
suitable mixing, smoothness and Vapnik-Chervonenkis learnabil- 
ity conditions admits universal schemes for joint lossy source 
coding and identification. We also give several explicit examples 
of parametric sources satisfying the regularity conditions. 

I. Introduction 

A universal source coding scheme is one that performs 
asymptotically optimally for all sources within a given class. 
Intuition suggests that a good universal coder should acquire 
a probabilistic model of the source from a sufficiently long 
data sequence and operate based on this model. For lossless 
codes, this intuition has been made rigorous by Rissanen [1]: 
the data are encoded via a two-part code which comprises 
(1) a suitably quantized maximum-likelihood estimate of the 
source parameters, and (2) an encoding of the data with the 
code optimized for the acquired model. The redundancy of 
this scheme converges to zero as k log n/n, where n is the 
block length and k is the dimension of the parameter space. 

Recently we have extended Rissanen's ideas to lossy 
block coding of finitely parametrized continuous-alphabet i.i.d. 
sources with bounded parameter spaces [2], [3]. We have 
shown that, under appropriate regularity conditions, there exist 
joint universal schemes for lossy coding and source identi- 
fication whose distortion redundancy and source estimation 
fidelity both converge to zero as 0(-^/logn/n) as the block 
length n tends to infinity. The code operates by coding each 
block with the code matched to the parameters estimated from 
the preceding block. Moreover, the constant hidden in the O(-) 
notation increases with the "richness" of the model class, as 
measured by the Vapnik-Chervonenkis (VC) dimension [4], 
[5] of a certain class of decision regions in the source alphabet. 

The main limitation of the results of [2], [3] is the i.i.d. 
assumption, which excludes such practically relevant model 
classes as autoregressive sources or Markov and hidden 
Markov processes. Furthermore, the assumption of a bounded 
parameter space may not be always justified. In this paper 
we relax both of these assumptions. Because the parameter 
space is not bounded, we have to use variable-rate codes with 
countably infinite codebooks, whose performance is naturally 



quantified by Lagrangians [6], [7]. We show that, under certain 
regularity conditions, there are universal schemes for joint 
lossy source coding and modeling such that, as the block 
length n tends to infinity, both the Lagrangian redundancy 
relative to the best variable-rate code at each block length and 
the source estimation fidelity at the decoder converge to zero as 
0(y/V n log n/n), where V n is the VC dimension of a certain 
class of decision regions induced by the collection of all n- 
dimensional marginals of the source process distributions. 

The key novel feature of our scheme is that, unlike most 
existing schemes for universal lossy coding, which rely on 
implicit identification of the active source, it learns an explicit 
probabilistic model. Moreover, our results clearly show that 
the "price of universality" of a modeling-based compres- 
sion scheme grows with the combinatorial richness of the 
underlying model class, as captured by the VC dimension 
sequence {V n }- The richer the model class, the harder it is 
to learn, which in turn affects the compression performance 
because we use the source parameters learned from past data in 
deciding how to encode the current block. These insights may 
prove useful in such settings as digital forensics or adaptive 
control under communication constraints, where trade-offs 
between the quality of parameter estimation and compression 
performance are of central importance. 

II. Preliminaries 

Let X — {Xi}i e z be a stationary, ergodic source with 
alphabet X. All alphabets are assumed to be Polish spaces 
equipped with their Borel er-fields. We adopt the usual setting 
of universal source coding: the process distribution of X is not 
known exactly, apart from being a member of some indexed 
class {Pg : 9 € A}. We assume that the parameter space A is 
an open subset of R fe with nonempty interior. We also assume 
that there exists a a-finite measure \i on X, such that for every 
9 6 A the n-dimensional marginals Pg of Pg are absolutely 
continuous with respect to (w.r.t.) the product measure fi n , for 
all n, denoting the corresponding densities dP g n /dfi" by pg. 

We wish to code X into a reproduction process X = 
{Xi}icz, with alphabet X by means of a finite-memory 
variable-rate lossy block code (vector quantizer). Such a code 
with block length n and memory length m [an (n, m)-block 
code, for short] is a pair C n ' m = (/, ip), where / : X n x 
X m — > S is the encoder, (p : S — ► X n is the decoder, and 
<5 C {0, 1}* is a finite or countable collection of binary strings 



satisfying the prefix condition. The mapping of X into X is 
defined by X^ = MKlT^Ztrn+i)), * e Z. 
where A^ = (jf^Xi+i, . . . ,Xj), i < j. Thus, the encoding 
is done in blocks of length n, but the encoder is also allowed to 
view the m source symbols immediately preceding the current 
n-block. Abusing notation, we shall denote by C n ' m both the 
composition ip o / and the pair (/, ip); when m = 0, we shall 
use a more compact notation C" and say "n-block code." 

Let p : A" x X — > M + be a measurable single-letter distortion 
function; p n (x n ,x n ) = ^ _1 X^i=i p( x ii%i) is the per-letter 
distortion due to reproducing x n 6 A"™ by x n € Af n . We as- 
sume that p is a metric on XUX, bounded from above by some 
Pmax < oo. Suppose X ~ Pg. Associated with the code C n,m 
are its expected distortion D g (C n > m ) = E e {p n (X?, X?)} 
and its expected rate R e (C n > m ) = E e {£ n (f(X?,X° m+1 ))}, 
where, for a binary string s, £ n (s) is its length in bits, 
normalized by n. When working with variable-rate quantizers, 
it is convenient [6], [7] to absorb the distortion and the rate into 
a single performance measure, the Lagrangian Lg(C n,m , A) = 
D e {C n ' m ) + XR e {C n ' m ), where A > is the Lagrange 
multiplier which controls the distortion-rate trade-off. The op- 
timal Lagrangian performance achievable on Pg by any zero- 
memory variable-rate quantizer with block length n is given by 
the nth-order operational distortion-rate Lagrangian L e l (X) = 
info Lg(C n , A) [6]. Allowing the codes to have nonzero 
memory does not improve optimal performance, because we 
can use memoryless nearest-neighbor encoders to convert any 
(n, m)-block code into an n-block code without increasing the 
Lagrangian. Thus, Lg(X) = mf m info,™ Lg(C n,m , A), where 
the infimum is over all memory lengths m and all (n, m)-block 
codes C n ' m , for a fixed block length n. Because each Pg is 
ergodic, Lg(X) converges, as n — > oo, to the distortion-rate 
Lagrangian L$(X) = min^ (^Dg(R) + XRj, where Dg(R) is 
the Shannon distortion-rate function of Pg [6]. 

III. The results 

In this section we state our result on universal schemes for 
joint lossy compression and identification of stationary sources 
satisfying certain regularity conditions. We wish to design 
a sequence of variable-rate vector quantizers, such that the 
decoder can reliably reconstruct the source sequence X and 
reliably identify the active source in an asymptotically optimal 
manner for all 6 € A. The identification performance will be 
judged in terms of the variational distance, which for any two 
probability measures P, Q on a measurable space (Z,A) is 
defined by d(P, Q) = 2 sup A6iA \P(A) - Q(A)\. Denoting by 
p and q the respective densities of P and Q w.r.t. a dominating 
measure v, we can also write d(P, Q) — J x \p(z)—q(z)\dv(z). 
The set of all Q satisfying d(P, Q) < 8 for a given P is called 
the variational ball of radius 5 around P. 

Our first condition ensures that each source in the class is 
sufficiently close to an i.i.d. source, in an asymptotic sense. 
Define the fcth /3-mixing coefficient of Pg [5] by 

pg (k) ± 2 Sup \Pg (A) - P e X P+ (A) | , 



where a(X°_ oc , A£°) is the cr-field generated by {Xi}i<o 
and {Xi}i>k, and P g ~ and P^ are the marginal distributions 
of {Xi}i<o and {Ai}i>o, respectively. An i.i.d. source has 
f3(k) = 0,Vfc; if f3(k) ^? 0, the source is called /3-mixing. 
Condition 1. The sources in {Pg : 9 £ A} are algebraically 
(3-mixing: 

3r>0 such that (3 e (k) = O(fc" r ),V0 € A. 

The second condition ensures that the parametrization of 
the sources is sufficiently smooth. 

Condition 2. Let d n (9,8') denote the variational distance 
between Pg and Pp. Then for every 6 A, 

d (9 $'1 

35g,c e > such that sup nK ' ' < c e \\9 - 9'\\ 

n V n 

for all 9' satisfying \\9' — 9\\ < Sg, where || ■ || denotes the 
Euclidean norm on M. k . 

This condition is met, for instance, if the asymptotic Fisher 
information matrix 1(9) exists for all 9 € A (under some 
technical assumptions on the densities pg). It guarantees that, 
for every sequence {(5„} ii6 n of positive reals satisfying S n — > 
0, y/nS n — > as n — > oo, and for every sequence {9 n } n ^ 
in A satisfying \\9 n — 9\\ < 6 n for a given 8 € A, we have 
d n (9 n , 9) — * as n — * oo. 

Finally, we impose a learnability condition. To state it we 
need some facts on Vapnik-Chervonenkis classes (see, e.g., 
[4], [5]). Let (Z, A) be a measurable space. Given a collection 
C of measurable subsets of Z, its Vapnik-Chervonenkis (VC) 
dimension V(C) is defined as the largest integer n for which 

max n \{(l {xieA} ,- ■ ■ , l {xneA} ) : A e C}\ = 2"; (1) 

if ([B holds for all n, then V(C) = oo. If V(C) < oo, we say 
that C is a VC class. The Vapnik-Chervonenkis inequalities are 
finite-sample bounds on uniform deviations of probabilities of 
events in a VC class from their relative frequencies: if X n = 
(Xi, ■ ■ ■ , X n ) is an i.i.d. sample from a probability measure 
P on (Z,A), and if C is a VC class with V(C) > 2, then 

Pf sup \Px»(A) - P(A)\ > e\ < 8n v ( c V" e2 / 32 ,Ve > 
Aec > 

and E( sup \P X «(A) - P(A)\\ < cy/\/(C) log n/n, 
^ Aec > 

where c > is a universal constanfl Px n is the empirical 

distribution of X n , and the probabilities and expectations are 

w.r.t. the product measure P n on (Z n ,A n ). 

Condition 3. For n £ N, let A n consist of all sets of the form 

Ag,g, = {x n e X n : P e(x n ) > pe>(x n )}, 8 + 9' 

(A n is the so-called Yatracos class defined by {p 7 e 1 }, see [4] 
and references therein). Then we require that each A n is a VC 
class, V„ = V(A n ) < oo, and that V n = o(n/logn). 

Theorem 1. Suppose Conditions 1-3 are satisfied. Then for 
every A, n > there exists a sequence {C"' m "}„gN of 

'Using more refined techniques, the c^/V(C) log n/n bound can be 
improved to c' ^/V(C)/n, where c' is another constant. However, c' is much 
larger than c, so any benefit of the new bound shows only for "impractically" 
large values of n. 



variable-rate vector quantizers with memory lengths m r , 
n(n + [ri( 2+, ''/ r ]), such that 



L e (C?> m ",X) -inf inf L e (C n > m ,X) = O I J Vnlogn ) 
m c n - m \ V n J 

for all 8 G A. Moreover, for each n, the binary description 
produced by the encoder is such that the decoder can identify 
the n-dimensional marginal of the active source up to a 
variational ball of radius 0(y/V n logn/n) almost surely. 

That is, for each n, 9 the code C™' m ", which is independent of 
9, performs almost as well as the best finite-memory quantizer 
with block length n that can be designed with full knowledge 
of Pg. Thus, as far as compression goes, our scheme can 
compete with all finite-memory variable-rate quantizers, with 
the additional bonus of allowing the decoder to identify the 
active source in an asymptotically optimal manner. Recalling 
the discussion of Lagrangian optimality in Section HU we see 
that Theorem Q] immediately implies the following: 

Corollary 2. The sequence {C"' m "} n gN is weakly minimax 
universal^ for {Pg : 9 G A}, i.e., for every 9 G A, 

Lg (C*' m " , A) — > Lg(X) as n — > oo. 

IV. The proof of TheoremQ] 

The main idea. It suffices to construct a universal scheme that 
can compete with all zero-memory codes; that is, we need to 
show that there exists a sequence {C"' m ™} of codes, such that 
Le(C"' m " ,X) - L£(A) = 0{y/V n \ogn/n) for all 9 G A. 

We assume throughout that the "true" source is Pg for 
some #o G A. Our code operates as follows. Suppose that 
both the encoder and the decoder have access to a countably 
infinite "database" c = {0(i)}igN C A. Using Elias' universal 
representation of the integers [8], we can associate to each 9(i) 
a unique binary string s(i) with i(s(i)) = log i + O(loglogi) 
bits. Suppose also that for each n, 9 there exists a zero- 
memory n-block code Cg = (fg,(pe) that achieves the nth- 
order Lagrangian optimum for Pg\ Lg(Cg, A) = Lg (A). The 
encoding of X™ into X™ is done as follows: 

1) The encoder estimates Pg g from the m„-block X^_ mn+1 
as Pf, where 6 = 6{X°_ mn+l ). 

2) The encoder then computes the waiting time 

T n = mi{i > 1 : d n (6(i),6(X°_ mn+1 )) < V^5 n }, 

with the standard convention that the infimum of the 
empty set is equal to +oo; {S n } is a sequence of positive 
reals to be specified later. 

3) If T n < +oo, the encoder sets 9 = 9(T n ); otherwise, 
the encoder sets 9 — 9(1) (or some other default 9). 

4) The description of X{ L is a concatenation of three binary 
strings: (i) a 1-bit flag b to tell whether T n is finite 
(b = 0) or infinite (b = 1); (ii) a binary string si which 
is equal to s(T„) if T n < +oo or is empty if T n = +oo; 
(iii) S2 = fg(Xi). The string s = bsi is the first-stage 
description, while s 2 is the second-stage description. 

2 See [6] for other notions of universality for lossy codes. 
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Fig. 1 . The structure of the code C» ' 
for estimating the source parameters. 
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The decoder receives bs 1 s 2 , determines 9 from s, and produces 
X™ = <p-g(s). If b — (which, as we shall show, will happen 
eventually a.s.), then P~ is in the variational ball of radius 
^/nS n around the estimated P~. If the latter is a good estimate, 

i.e., d n (9o, 9) as n — * oo, then the decoder's estimate of 
Pg o is only slightly worse. Moreover, the a.s. convergence of 
d n (9o,9) to zero as n — > oo implies that the performance of 
C~ on Pg a is close to the optimum Lg a (Cg g , A) = Lg g (X). 

Formally, the code Cl 1 '™" is comprised by the following 
maps: (1) the parameter estimator 9 : X m " — > A; (2) the 
parameter encoder g : A — > S, where S — {Os(i)}i S N U {1}; 
(3) the parameter decoder iji : S — ► A. Let / denote the 
composition go 9 of the parameter estimator and the parameter 
encoder, which we refer to as the first-stage encoder, and let 9 
denote the composition ip o f of the parameter decoder and the 
first-stage encoder. The decoder rp is the first-stage decoder. 
The collection {Cg : 9 G A} defines the second-stage codes. 
The encoder /* : X n x X m " — > S x S and the decoder tp* : 
SxS^X n of C"' m " are defined as , X°_ mn+1 ) = 

f{ x - mn +i)fe(x°_ mn+1 )( x i) and ^*( ?s ) = ^(5)( s ) for a11 
s G S, s G S, respectively. 

To assess the performance of the code, introduce the func- 
tions g{x n ,y m -) = p n (x», } (x")) + AC(/ ?( ^„„ } (x n )) 

and h(y m ") = i n (f(y m "))- Then h{X\ ln+1 ) is the 
normalized length of the first-stage description, while 
g(Xi, X^_ m +1 ) is the instantaneous Lagrangian performance 
of the corresponding second-stage code. The expected La- 
grangian performance of our code is 

Le (C"' m " , X) = E 8o g(X{\ X°_ mn+1 ) + AE 0Q h(X a _„ ln+1 ). 

We prove the theorem by showing that, with proper choices 
for the memory length m n , the "database" c, the param- 
eter estimator 9, and the sequence {5 n }, we can ensure 
that Eg h(X°_ mn+1 ) = 0(k\ogn/n) + 0(loglogn/n) + 
o(l), Eg g(X?,X°_ mn+1 ) =U g \X)+ 0{^V n \ogn/n), and 
d n (9o,9(X°_ mn+1 )) = 0{^fV n \ogn/n) F eo -almost surely. 
Step 1: choice of memory length. Let l n = \n^ 2+v ^ r ~\ and 
m n = n(n + l n ). Divide X°_ m +1 into n blocks Z\,...,Z n 
of length n interleaved by n blocks Yi,...,Y n of length l n 
(see Figure [TJ. The parameter estimator 9, although defined as 
acting on the entire X^_ m +1 , effectively will make use only 
of Z n = (Zi, . . . , Z n ). Each Z.j - P e ™ , but the Zj's are not 
independent. Let Q^ n ' denote the marginal distribution of Z n , 
and let denote the product of n copies of Pg Q . Using 

induction and the definition of the /3-mixing coefficient, we 
can show that d{Q^ n \ Q^) < (n-l)0e o (l n ) = 0{l/n 1+ i), 



which follows from Condition 1 and our choice of l n . This 
"blocking technique" [9] allows us to approximate certain 
probabilities and expectations w.r.t. Pg a by probabilities and 
expectations w.r.t. suitably constructed i.i.d. processes. 
Step 2: construction of the database. We proceed by random 
selection. Let W be some probability measure on A with a 
positive, everywhere continuous density w(9). We generate 
C = {(9(«)}igN as an i.i.d. sequence of vectors in A drawn 
according to W, independently of X. 

Step 3: estimation of the active source. We use the Devroye- 
Lugosi minimum-distance estimator (MDE) (see [4] and ref- 
erences therein). Namely, given the estimation blocks Z n = 
(Zi, . . . , Z n ), define U e (Z n ) = sup AeAn \If(A) - iV(A)| 
for every 9 € A, where the supremum is over all sets in the 
Yatracos class A„ and Pz n is the empirical distribution on X n 
induced by Z n . Then 6(X^_ m +1 ) is any 6* E A satisfying 
Ue*(Z™) < infg e A Ug(Z™) + l/n (the extra 1/n term ensures 
that at least one such 9* exists). Note that 9(X^_ m +1 ) only 
depends on Z n . The key property of the MDE is [4] 

d n (9 Q ,9(X°_ mn+1 )) < We {Z1) + 3/n, (2) 

which holds regardless of whether Z n is i.i.d. or not. 
Step 4: expected first-stage description length. We follow 
the ideas of [10]. Let us assume that the sequence {5 n } is 
such that 6 n — > as n — > oo. Define the event F n — 
{9 E A : d n (9,9(X°_ mn+1 )) < y/n~6 n } and note that if 
q„ = W{F n \X° mn+1 = x°_ mn+1 ) > 0, then the waiting 
time T n is a geometric random variable with parameter q n . 
Condition 2 ensures that, in fact, q n > for n sufficiently 
large, for Pg -almost all realizations of X. Using the Borel- 
Cantelli lemma, it is not hard to show that Eg log T n < 
log log n + 2 — Eg log q n for all realizations of C, eventually 
Pe -a.s. We now lower-bound q n for large n. Using the triangle 
inequality, independence of X and C, Condition 2 and the fact 
that 5 n — > as n — > oo, we have, for n sufficiently large, 

q n > w(\\Q-9 \\ < 5 n /2ce o )Pe o (d n (e ,e) 

where 9 = 9{X°_ mn+1 ) and 9 ~ W. Via sim- 
ple volume bounding, W^(||6 — 9q\\ < S n /2cg a ) > 
(l/2)w(9o)vk(S n /2ce ) k for n sufficiently large, where Vk is 
the volume of the unit sphere in K . . Next, we use blocking 
to approximate Pg -probabilities by Q^> -probabilities, and 
then invoke the property (O of the MDE and the Vapnik- 
Chervonenkis inequalities to obtain 

Pe (d n (6 Q ,e(X°_ mn+1 )) < 

> 1 - 871 V(-4-) e -™(^-6/n) 2 /2048 _ (l/n 1+, >). 
. - •i/2048(l / „ + l) Inn R „ , 

Choosing 5 n = ^ h ^72, we get for the 

normalized expected first-stage description lengtr|j 

E 9o h{X Q _ mn+1 ) = 0(k\ogn/n) + 0(loglogn/n) + o(l). 
The sequence 5„ indeed converges to owing to Condition 3. 
Step 5: expected second-stage Lagrangian performance. Using 

3 Note that, up to a constant, the first term on the right-hand side has the 
same form as in Rissanen [1]; additional terms are due to the unboundedness 
of A and the fact that the points 0(i) do not form a regular grid. 



the fact that the distortion measure p is bounded, one can show 
via an argument similar to the proof of Lemma 9 in Section 10 
of [7] that for every 9 E A there is no loss of generality 
in assuming that an n-block code C$ — (fe,<pe) achieving 
££(A) satisfies £ n (fe(x n )) < 2p max /A for all x n G X n . 
Thus, g is bounded by 3p max . A straightforward application of 
Fubini's theorem and the definition of the /3-mixing coefficient 
yields E eo g(X?,X°_ mn+1 ) < E 9o L eo (CS, A) + 0(l/n 2 +"), 

where 9 = 9(X°_ m Thus, the Lagrangian performance 
of the second-stage code is determined by the behavior of the 
code C~ (which depends on X°_ nin+1 ). Because p is a metric, 
a basic Lagrangian mismatch argument (see, e.g., Lemma 9 in 
Section 8 of [7]) shows that 

E 9o L 9o (CS, A) < E So L 9o , A) + 4p max E 9o d n (9 , 9). 

By blocking, the expectation of d n (9o,9) w.r.t. Pg can be 
approximated by expectation w.r.t. Q^ n \ Followed by an 
application of the triangle inequality, this yields 

E 9o d„(9 ,9)<EQ M {d n (9 ,9) + d n {9,9)}+O{l/n 1+ v), 

where 9 = 9(X° mn+l ) is the MD estimate of Q Q . Now, 
d n (9,9) < y/n5 n — 0(y/V n log n/n) eventually almost 
surely, by construction of the first-stage encoder. The expec- 
tation Eg ( „j d n (9o,9) can be handled via (O and the Vapnik- 
Chervonenkis inequalities, yielding 

Eg a d n (9 , 9) = 0(vXlogn/n) + 0{l/n 1+r >). 

Thus, Eg a {g} = Ll(X) + 0(^V n \ogn/n) + 0(l/n 1+ "). 

Step 6: the overall performance. Gathering together our esti- 
mates for the first stage and for the second stage, we get 

Le (C" ,m " , A) - Ll(\) + OWV n \ogn/n) 
+0(k\ogn/n) + 0(loglogn/n) + o(l) 

for almost every realization of the database C. As for the per- 
formance of the scheme in identifying the active source, note 
that, with our choice of l n , the sequence nf3$ (l n ) is summable 
in n. Then a straightforward application of the Borel-Cantelli 
lemma and the Vapnik-Chervonenkis inequalities yields 

d n (0 o ,?(X° mii+1 )) = O (vV„ log n/n) , P 8o - a.s.. 

V. Examples 

Here, we present three examples of parametric families 
satisfying the conditions of Theorem Q] and thus admitting 
joint universal lossy coding and identification schemes. The 
following result [5] will be used throughout: Let C = {A^ : 
£ 6 R-^} be a collection of measurable subsets of K. d , such 
that A^{zel i : IL(z,£) > 0} for all £, where for each 
z E K d , LT(z, •) is a polynomial of degree s in the components 
of £. Then C is a VC class with V(C) < 2AHog(4es). 
Stationary memoryless sources. Let X = R, and let {Pg : 
9 E A} be the collection of all Gaussian i.i.d. processes with 
mean m E M and variance a E (0, oo). Thus A = {(m,a) : 
in G 1,0 < c < oo} C M 2 . This class of sources trivially sat- 
isfies Condition 1 with r = +oo, and it remains to check Con- 
ditions 2 and 3. To check Condition 2, consider the normalized 



relative entropy (information divergence) D n (6\\6') between 
P™ and P™,, with = (m,a) and 9' = (m',a') (which is 
equal to Di(9\\8') because the sources are i.i.d.). It is not 
hard to get the bound D n (6\\9') < (1 + a' / af \\9-9'\\ 2 /2a' 2 . 
Now fix a small 8 G (0, a) and suppose that j|0 — 0'\\ < 8. 
Then | a — a' \ < 8, so we can further upper-bound D n {6\\9') 
as D n (6\\6') < ^\\9-9'\\ 2 for all 0' in the open ball of radius 
8 around 9, with eg = 3/(er — 8). Using Pinsker's inequality 
[4], we have d n (6,6')/y/E < y/2D n (6\\0') < c e \\9 - 9'\\ for 
all n. Thus, Condition 2 holds. To check Condition 3 note 
that, for each n, the Yatracos class A n consists of all sets 
of the form {a; 11 G M™ : U(x n ,9,9') > 0}, 9,9' G A, where 
for each x n G X n U(x n ,9,9') is a third-degree polynomial 
in (lncr 2 , lncr' 2 , 1/a 2 , 1/cr' 2 , m, m'). Thus, ^4„ is a VC class 
with V(A„) < 121og(12e), satisfying Condition 3. 

Autoregressive (AR) sources. Let X = R and let X be a 

Gaussian AR(p) source. That is, there exist p real parameters 
ai,...,a p , such that X n = + Yn for all 

n, where Y = {Yi} ie z is an i.i.d. Gaussian process with 
zero mean and unit variance. Let A C MP be the set of 
all a\, . . . , ftp, such that all roots of the polynomial A(z) = 
YL\=o a * z *' a ° = 1> ne outside the unit circle in the complex 
plane. Under these conditions, for each 9 G A the process 
X is exponentially f3-mixing [11], i.e., there exists some 
7 = 7(0) € (0,1), such that (i e (k) = 0(~f k ). Now, for any 
fixed r > 0, j k < k~ r for k sufficiently large, so Condition 1 
holds. For Condition 2, it can be shown that, for each 9 G A, 
the asymptotic Fisher information matrix 1(9) exists (and is 
nonsingular) [12]. Thus, Condition 2 can be met. To verify 
Condition 3, consider the n-dimensional marginal Pg(x n ), 
which has the normal density pg(x n ) = N(x n ; 0, R n (9)), 
where R n (9) is the nth-order autocorrelation matrix of X. For 
every 6 G A, let 6 = (0,lndet fl" 1 ^)). Since lndet-R" 1 ^) 
is uniquely determined by 0, we have Ag t gi — Ag§, for 
all sets in the Yatracos class A n . This, and the fact that 
the entries of R~ 1 (9) are quadratic functions of ai,...,a p , 
implies that, for each x n , the condition x n G Ag^gi can be 
expressed as H(x n ,9,9') > 0, where H(x n , ■) is quadratic 
in the 2p + 2 real variables 0~i, . . . , 9 p+ i,9[, . . . , 9' p+1 . Thus, 
V(A0 < (4p + 4) log(8e). Therefore, Condition 3 is met. 

Hidden Markov processes. A hidden Markov process is 
a discrete-time finite-state homogeneous Markov chain, ob- 
served through a discrete-time memoryless channel (see [13] 
and references therein). Let S = {Si}i e z be a stationary 
ergodic Markov process with M < oo states and the (unique) 
stationary distribution n = (m, . . . , ttm)- Let a,ij — P(S t +i = 
j\St = i), 1 < i, j < M, denote the corresponding one-step 
transition probabilities. Let X = M. d , and consider a discrete- 
time memoryless channel with input alphabet S = {1, . . . , M} 
and output alphabet X, specified by a collection {p(-|s) : ,s e 
S} of probability densities on M. d w.r.t. the Lebesgue measure. 
The output process X = {Xi} ie z is the source of interest. 

Let us assume that the channel transition densities are 
known, and that the one-step transition probabilities of the 
underlying Markov chain S are known to be strictly positive 



and bounded from below by some ao > 0. Thus, our parameter 
space is the set A = {0 = [ay] G M MxM : a io > a ,Vi,j}. 
Under these assumptions, for any e A the underlying 
Markov process S is exponentially /3-mixing [14]. It can also 
be shown [5] that for every G A there exists a measurable 
map F : S x [0,1] -> X, such that X, = F(S l ,U l ) 
for all i G Z, where {/, are i.i.d. random variables with 
uniform distribution on [0,1], independent of S. The pair 
process {(Si, U{)} is exponentially /3-mixing, and therefore so 
is X. This establishes Condition 1. Under additional technical 
assumptions on the densities {p(-|s)} it can be shown that 
the asymptotic Fisher information matrix 1(9) exists for all 
G A [15], which implies that Condition 2 holds as well. 
Finally, to show that Condition 3 is satisfied, note that the 
n-dimensional marginal of Pg for a given = [a^] has 

the density p g (x n ) = ^ s » eS » 11™= l a Si _ lSi p( x i\ s i)' where 
a S()S = ir s for all s. Then it follows that the Yatracos class A n 
consists of sets of the form {x n G X n : IV(x n ,9,9') > 0}, 
= [a,ij],9' = [a'ij] G A, where H(x n ,-) is a polynomial of 
degree n in the 2M 2 parameters {a^, a^}. Thus, V(^4„) < 
AM 2 log(4en), so that Condition 3 holds as well. 
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